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The information conveyed by a hierarchical attractor neural network is examined. The network 
learns sets of correlated patterns (the examples) in the lowest level of the hierarchical tree and can 
categorize them at the upper levels. A way to measure the non-extensive information content of the 
examples is formulated. Curves showing the transition from a large retrieval information to a large 
categorization information behavior, when the number of examples increase, are displayed. The 
conditions for the maximal information are given as functions of the correlation between examples 
and the load of concepts. Numerical simulations support the analytical results. 

PACS numbers: 87.10.+e, 64.60Cn, 02.50.-r 



I. INTRODUCTION 



In the context of learning rules by perceptrons, gener- 
alization by a neural network is the capability of correctly 
classify new patterns after some examples being taught to 
the network (see, e.g., |l|). For attractor neural networks, 
another type of generalization was suggested, the cate- 
gorization, that emerges from an encoding stage where 
a hierarchical tree of patterns is stored ||. The ability 
of the network classifying the patterns on a lower level 
of the tree (i.e., the examples), into categories defined 
by their ancestors (i.e., the concepts), arises from the 
Hopfield model if the examples are correlated with their 
concepts ||. 

A minimal number S of examples for each concept is 
necessary to start the categorization. An extensive num- 
ber of concepts is then learned by memorizing finite sets 
of examples. This was shown for networks of binary neu- 




diluted 
, ternary 



or layered 
, and non- 



rons with fully-connected 
JjJ architectures, and for analog 

monotonic |]Io|| neurons, using Hebbian synapses. Simi- 
lar behavior was found for pseudo- inverse synapses [ |y] . 
Categorization is achieved through the appearance of 
symmetric spurious states. This ability to categorize 
start just when the capacity of the network recovering 
the original examples is lost, because of the interference 
generated by their correlations. 

As in most models for pattern recognition, the ade- 
quate analysis of the memory capacity of this networks 
require the tools of the information theory. In the case 
of non-biased independent patterns, one can avoid it and 
measure the performance through the Hamming distance 
D between the neuron and the retrieved pattern, and the 
load capacity a. One scenario where D and a are not 
enough to characterize the system is that of sparse coded 
patterns [O . Another one is that of dependent patterns. 
This is case for categorization models, since the informa- 
tion conveyed by the examples is not extensive in them. 



Our goal in this work is to establish a reliable measure 
for the capacity of retrieving examples and the catego- 
rization, based in the information theory. In the next 
section, we define the model and its parameters. After 
obtaining the expressions for the information capacity in 
the section III, in the section IV we study some special 
cases which present the transition from a retrieval to a 
categorization phase. Finally we conclude with some re- 
marks in the section V. 



II. THE MODEL 

Consider a network of N binary neurons, with states 
{<T itt e ±l}ili at time t. The neurons states are updated 
in parallel according to the deterministic rule 



<j ijt+1 = signet) ; hi 



N 

^2 J v a jt, (!) 



where hij is the local field of neuron i at time t. The ele- 
ments of the Hcbbian-like synaptic matrix between neu- 
rons i and j are given by 



(2) 



where {r]i P }f=i are the examples of the concept . The 
concepts are independent identically distributed random 
variables (II DRV), = ±1}? =1 , with equal probabil- 
ity. 

In the encoding stage, the examples are built from the 
concepts, according to the stochastic process: 

pfoTie?) = W - £f ) + (! - &Wk- T - 1), (3) 

where b = (??f p £;f ) gives the correlation between the an- 
cestors (the concepts) and the descendants (the exam- 
ples) of this tree of patterns. The second delta of this con- 
ditional distribution gives the component of the examples 
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which is independent on the concepts. This process can 
equivalently be formulated as r) pp = £f A^ p , where the 
biased IIDRV Af 9 are distributed according to 

VbW) = B+5W 1) + + 1), (4) 

with B± = (1± b)/2. 

The macroscopic parameters which describe the state 
of the network are the retrieval and the categorization 
overlaps, respectively: 

™m = ^E ^ M m = jf E ( 5 ) 

In the thermodynamic limit, the qualities of the retrieval 
and of the categorization can be measured by taking the 
limiV — > oo of the overlaps for a single concept, say 
fx = 1, which give 

m\ p = (r ? ^ S ign[/ lt _ 1 ]) ) M x t = (fVW). ( 6 ) 

where the brackets mean averages over the set of exam- 
ples rj lp and the local field ht-% for a single neuron. 

The generalization error ]IJ, j^] can be defined as 
E\ = (\a t — C 1 ] 2 ) = 1 — Mj 1 , as a function of the cat- 
egorization overlap. The stationary states are given by 
macroscopic overlaps with examples of a given concept, 
say m}£ = m lp , and microscopic remaining overlaps 
v > 1, m vp ~ l/ViV. The general solution is repre- 
sented by a retrieval overlap with a single example, say 
to 11 = to, and the quasi — symmetric overlaps with the 
other examples, m lp = to s , p > 1. In the retrieval phase 
one have to ~ 1, m s ~ 6 2 , while in the categorization 
phase the stable state is to = m s ~ 6, which may leads 
to a large categorization overlap, = M ~ 1. In the 
following we will consider a situation where the network 
have relaxed to the equilibrium states, so we can drop 
the time t on the parameters. 



III. INFORMATION CAPACITIES 

In this section we describe a way to measure the stor- 
age of information by the network in the retrieving and 
categorizing regimes. There are two types of information 
to be extracted from the patterns in these network: the 
retrieval information and the categorization information. 
The former is that which can be conveyed from the ex- 
amples to the neurons, while the latter is that which can 
be conveyed from the concepts. In each case one must 
calculate the information entropy of the pattern distri- 
butions, H = - E m P«tf }) l°gb>({tf })], and 
H[{v? P }?ff] = -E {vn P(H P })^g[ P (H iP m where 
p({£i }) an( ^P({ r li P }) are the concepts and examples joint 
probability distributions, respectively. 



The categorization information can be easily measured 
by computing the categorization overlap of a single con- 
cept, M, and its entropy. Since the concepts {£f }f^ p 
are IIDRV, their probability distribution is factorial, 
P({£f}f/f) = IlffP^i)- Thus > thc entropy of the con- 
cepts is extensive, H[{$}?f] = H[$] = pNHfc], 
where the entropy of a single concept on a single neu- 
ron is H[£] — log(2). As we study binary patterns, we 
shall use base-2 logarithm in order to count information 
in bits, then we have H[£] = 1. The equivocation in the 
categorization can be evaluated by the square of the over- 
lap, in such a way that no information is transmitted by 
the concepts if M = and thc information is maximal if 
M = ±1, reminding that the information is symmetric in 
this overlap, because an inverted concept a% = — £i car- 
ries the same information than o~\ = Therefore, the 
total categorization information is Ic = pNM 2 H [£], and 
the categorization information (per synapse) is 

i c = aM 2 . (7) 

The retrieval information can be similarly measured, 
by computing the retrieval overlap and the entropy of 
the examples, since this entropy can also be factorized 
as P({vn?ff) = Il?fpM P } S P ), such that the en- 
tropy is extensive in the concepts and in the neurons, 
HM P )^f\ =E£? H[H IP } S P ] = P NH[{rf}f]. Thus 
it is enough to calculate the entropy of a set of examples 
of a single concept, {n p }p = {^f p }p , on a single neuron, 
to get the entropy of the whole set {Vi P }i l ff- 

On the other hand, {if}p is not a set of IIDRV, 
so p({n p }p) is not factorizable in example probabili- 
ties, and the entropy is not extensive in the examples, 
H[{n p } s p ] ^ Yf p H \ri p }. So the retrieval informat ion is 
not the naive one, in =/= aS. 

Let {if} = {i] p }p be a set of examples of a given con- 
cept on a given neuron. In calculating p({n p }) we proceed 
as it follows: we take the conditional probability of the 
examples given the concept, p({r] p }\^), from Eq.(^), and 
average it on the distribution of £, 

Pim) = mrfmh = fl PBm \ PB{ -*\ m 

P =i 

where pb is the probability distribution in Eq.(Q). After 
expanding this product, we calculated the entropy of this 
distribution, obtaining 

s 

H[{n p }]=-Y J C S k A k \og(A k )- 

A k = [B k + B s _- k + B k _Bl~ k }/2, (9) 

where C k are the combinatorial numbers. 

In evaluating the equivocation in the retrieval, we here 
have to multiply this entropy by the square of the re- 
trieval overlap of a single example. Since we have to 
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subtract the information due to the categorization, and 
the overlaps between examples and their concepts are 
b = (i] p £,), we estimate the total retrieval information 
as Ir — pN(m — bM) 2 H[{r/ P }]. Therefore the retrieval 
information (per synapse) is 



Ps (k) = Cjf-^Bf- 1 -*, (13) 



i R = a(m-bM) 2 H[{ri P }}. 



(10) 



Although other measures for the informations could 
be used, they must be monotonous functions of those we 
consider in the Eqs . (|7[) , (|lO|) ■ These have, nevertheless, 
the advantage that both are equivalently scaled and they 
can be directly compared to each other. 



IV. RESULTS 

We present now the equilibrium states for the networks 
which are used to obtain the retrieval and categorization 
informations. This states are studied for two systems: 
an asymptotic network (N — > oo), for which analytical 
stationary equations were derived |^] and finite-sized sys- 
tems, for which simulations of the dynamics in Eq.(Q) 
are carried on. While the information measures obtained 
in the previous section are functions of asymptotic pa- 
rameters, M and m, the results from simulation use the 
overlaps in Eqs.(||). 



A. Asymptotic network 

First we study the stationary states of the overlaps in 
Eqs.(||), in the thermodynamic limit N — ► oo. Using the 
Hebbian synapses in Eq.(^|) in the dynamics in Eq. ([!]), 
taking the local field at the fixed point, and averaging 
over the distribution of a single example, one get: 



M 



rn 



S-l 

E 

k=0 
S-l 

E 

k=0 
S-l 



/oo 
Dz[B + G+-B_G-}, 
-OO 

/OO 
Dz[B + G++B_G-}, 
- OO 

T.Ps^J^jj Dx[B+G+ + B-GJ\, (11) 



for the retrieval, categorization and quasi-symmetric 
overlap, respectively. Here 



G± = sign[a;sTO ± m + zs/ar\, 



(12) 



with xs = J2 p= 2 = 2fe — (# — 1) . and the averages are 
over the remaining S—l examples from the first concept, 
and the remaining p — 1 concepts. The first is the bino- 
mial variable xs — 2fc — (S — 1), distributed according 
to 



the last is a Gaussian noise, distributed according to 



„ dz 
Dz = -^=e 
2tt 



(14) 



In the present case of a fully-connected network, there is 
a strong feedback in the dynamics, but an expression for 
the variance of the noise can be obtained using a replica 
symmetric approach ||, 

[1 - C(l - b 2 )(l - b 2 + sb 2 )] 2 + {s - l)b 4 



[1 - C(l - b 2 )] 2 [l - C(l - b 2 + sb 2 )]< 



with 



-i S— 1 ~oo 

C-^^p s (fc)/ Dzz[B+G+ + B-G-]. (16) 

We have to solve this Eqs.(|ll|)-([l^), then we intro- 
duce the overlaps in the expressions for the informations, 
Eqs . ([t|) , (|To|) ■ These analytical results for the information 
are then presented in comparison with the results from 
simulation. 



B. Simulation 

The simulations we have performed are for networks 
of N = 5, 000 and N — 10 4 neurons, which are updated 
in parallel according to the dynamics in Eq.(|l|), up to 
t = 10 time steps, or when the overlaps converge. Thus 
we have almost stationary states in most cases, except 
when a state of non-information is obtained, for which 
the times of convergence are typically much larger. 

The capacity is analized as a function of the two pa- 
rameters of loading of the network: the rate of loading 
of concepts, a = p/N, and the number of examples per 
concept, S. The sample averages arc taken over an inter- 
val in ln(S') or in a. When simulating the information as 
a function of S 1 , we generate first the concepts and then 
store consecutively the examples of each concept. When 
simulating the information as a function of a, we gener- 
ate the S examples of the concept generated at each step 
of the learning. 

The network is trained then storing examples, while 
the retrieval and categorization overlaps are monotorized. 
For a fixed a, it is expected that increasing S the net- 
work pass from a regime where the retrieving information 
is large to another where the categorizing information in- 
crease up to saturation in a upper bound. This behavior 
is seen in Fig.l, where the overlaps, as well as the in- 
formations, are plotted as a function of ln(S'), with a 
correlation b = 0.3, for a loading of concepts a = 0.01. 
When more and more examples are learned, the retrieval 
information increases until a maximum at Sr = 7, then 
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it falls down. After a while when no information is trans- 
mitted, the network reach, at Sc ~ 33, the categoriza- 
tion phase, where the categorization information jumps 
to a higher value. It continues to increase untill it satu- 
rates in ic = 0.01, when the network reach M ~ 1 after 
S ~ 90. The retrieval information capacity of the net- 
work is in ~ 0.06. The asymptotic theory for N — > oo 
fits quite well the simulation for N — 10 4 , except in the 
region of no information. This is due to the finite number 
of steps used in the dynamical simulation, t — 10, while 
the convergence to the fixed-point there is very slow. 

A case with a larger load of concepts, a = 0.04, is plot- 
ted in Fig. 2. Although now the network can only retrieve 
well the examples up to S — 3, it has in ~ 0.10. Then 
there is a large waiting period where the informations 
stay close to zero, up to Sc ~ 74, when the categoriza- 
tion information jumps to ic ~ 0.04, which is much larger 
than in the case a = 0.01. 

Comparing this with a network with larger correlation, 
b = 0.4, plotted in Fig. 3, we observe that the network 
can store with a larger overlap only Sr. = 2 examples, 
with a maximal retrieval information in ~ 0.05, which 
is somewhat smaller than the naive Sa ~ 0.08. However 
the categorization information approaches its saturation 
value ic ~ 0.04 much faster, only S ~ 30 examples must 
be learned. We have checked that for larger load of con- 
cepts (a > 0.06) the categorization information is larger 
than the retrieval information. Also we verified the for 
higher correlations (b > 0.6) the categorization informa- 
tion can be the larger one, even for small load a ~ 0.01, 
while for smaller correlations (b < 0.2) the retrieval in- 
formation is always the larger one. 

For a fixed S, one expects that increasing a the catego- 
rization information (if b or S are large enough) increases 
up to a maximum value after which it decreases until it 
becomes zero at a critical a. This behavior can be seen 
in Fig. 4, where the case b = 0.2, S = 170 is plotted. 
We verified that the larger the values of b, the higher 
are the maxima of ic, and less examples are need. We 
also observed that the retrieval information have a simi- 
lar non-monotonic behavior if b or S are small. 



behave as an associative memory after a long period of 
resting between Sr < S < Sc is an advantage with re- 
spect to Hopfield network. It is also of worth of note that 
the retrieval information can still be relatively large, as 
we see in Fig. 2, a quotation which was not observed be- 
fore in any work about the categorization model in the 
literature. 

The simulation results fit very well with the theoreti- 
cal in both retrieval and categorization regimes, showing 
that almost no effect of finite-size is present, but the time 
of convergence in the resting period must be much larger 
than that used in this work. 

Both expressions for the information of the retrieval 
and of the categorization in Eqs.(|l0f|7|) are not claimed 
to be exact. They are approximations for a more precise 
measure, the mutual information [fl3| between neuron 
and patterns, T[a, £] = H[£] — (H[<j\£])£, where £f[cr|£] is 
the conditional entropy. Since we know that the condi- 
tional probability of the neuron, given the concept state, 
is p(er|£) = (1 + Ma£)S(\a\ 2 — 1), we can replace the 
categorization information by 

2fo £] = Ml + M) + Ml - M). (17) 

This quantity gives the degree of information the neuron 
can catch from the concept. However we prefer to use 
the estimation in Eq.(J^) to compare with the retrieval 
information with the same precision. 

Finally we hope that the present approach to the infor- 
mation content of a neural network of correlated patterns 
can be used in the context of more general architectures 
and learning rules. A more general distribution of the 
Xf |1J] may also deserves some attention. 
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V. CONCLUSION 

The information conveyed by the categorization model 
was studied. It was shown that the transition from the re- 
trieval phase to the categorization phase carries together 
a transition in the information: the retrieval informa- 
tion decreases when the network is oversaturated with 
examples, and after a period of resting, the categoriza- 
tion information increases. 

It is interesting to note that, although neither the re- 
trieval nor the categorization informations surpasses the 
usual Hopfield model, (S = 1, b = 1), which is in ~ 0.13 
at a = 0.135, the fact that the network can return to 
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FIG. 1. The overlaps (Top) and Informations (Bottom) 
as functions of ln(S'), for b = 0.3 and a = 0.01. The squares 
(circles) are the simulation results for retrieval (categoriza- 
tion) for N = 10 4 , t = 10; the dashed (solid) curves are the 
asymptotic results. 
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FIG. 2. Same as Fig.l, for b = 0.3 and a = 0.04. 
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FIG. 3. Same as Fig.l, for 6 = 0.4 and a = 0.04. 
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FIG. 4. The categorization information as a function of a, 
for b = 0.2 and S = 170. Asymptotitc (solid) and Simulation 
for N = 5,000 (dashed). 
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